Project Overview & Objective:

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

  • To predict whether a liability customer will buy a personal loan or not.
  • Which variables are most significant.
  • Which segment of customers should be targeted more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
  • Securities_Account: Does the customer have securities account with the bank?
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
  • Online: Do customers use internet banking facilities?
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?

Scoring Rubric:

  • Perform an Exploratory Data Analysis on the data: Univariate analysis, Bivariate analysis, Use appropriate visualizations to identify the patterns and insights, Any other exploratory deep dive (5 Points)
  • Illustrate the insights based on EDA: Key meaningful observations on the relationship between variables (3 Points)
  • Data Pre-processing: Prepare the data for analysis - Missing value Treatment, Outlier Detection(treat, if needed), Feature Engineering, Prepare data for modelling and check the split (6 Points)
  • Model building - Logistic Regression: Build the logistic regression model. Comment on model performance (4 Points)
  • Model performance evaluation and improvement: Comment on which metric is right for model performance evaluation and why? Can model performance be improved? If yes, then do it using appropriate techniques for logistic regression and comment on model performance after improvement (5 Points)
  • Model building - Decision Tree: Build the model and comment on the model performance. Identify the key variables that have a strong relationship with the dependent variable. Comment on model performance (4 Points)
  • Model performance evaluation and improvement: Try pruning technique(s). Evaluate the model on appropriate metric. Comment on model performance (5 Points)
  • Actionable Insights & Recommendations: Compare decision tree and Logistic regression. Conclude with the key takeaways for the marketing team - what would your advice be on how to do this campaign? (5 Points)
  • Notebook - Overall: Structure and flow - Well commented code (3 Points)
In [ ]:
# Importing the Necessary Libraries

import numpy as np   
import pandas as pd   

import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

!pip install nb-black
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nb-black
  Downloading nb_black-1.0.7.tar.gz (4.8 kB)
Requirement already satisfied: ipython in /usr/local/lib/python3.7/dist-packages (from nb-black) (7.9.0)
Collecting black>='19.3'
  Downloading black-22.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 4.6 MB/s 
Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from black>='19.3'->nb-black) (2.0.1)
Collecting pathspec>=0.9.0
  Downloading pathspec-0.10.1-py3-none-any.whl (27 kB)
Collecting click>=8.0.0
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
     |████████████████████████████████| 96 kB 2.4 MB/s 
Collecting platformdirs>=2
  Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)
Requirement already satisfied: typing-extensions>=3.10.0.0 in /usr/local/lib/python3.7/dist-packages (from black>='19.3'->nb-black) (4.1.1)
Collecting mypy-extensions>=0.4.3
  Downloading mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)
Collecting typed-ast>=1.4.2
  Downloading typed_ast-1.5.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (843 kB)
     |████████████████████████████████| 843 kB 23.6 MB/s 
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from click>=8.0.0->black>='19.3'->nb-black) (4.13.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->click>=8.0.0->black>='19.3'->nb-black) (3.10.0)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (4.8.0)
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
     |████████████████████████████████| 1.6 MB 31.6 MB/s 
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (0.7.5)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (2.6.1)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (4.4.2)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (5.1.1)
Requirement already satisfied: backcall in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (0.2.0)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (2.0.10)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.7/dist-packages (from ipython->nb-black) (57.4.0)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.7/dist-packages (from jedi>=0.10->ipython->nb-black) (0.8.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->nb-black) (0.2.5)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->nb-black) (1.15.0)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect->ipython->nb-black) (0.7.0)
Building wheels for collected packages: nb-black
  Building wheel for nb-black (setup.py) ... done
  Created wheel for nb-black: filename=nb_black-1.0.7-py3-none-any.whl size=5297 sha256=b834be72fc73dd020e3904ce42b1a36a9d93156c5e9b98636c4642420de70e9a
  Stored in directory: /root/.cache/pip/wheels/1e/b2/88/51c66d23ea5fd0d40ed50997555e15d981d92671376a9a412a
Successfully built nb-black
Installing collected packages: typed-ast, platformdirs, pathspec, mypy-extensions, jedi, click, black, nb-black
  Attempting uninstall: click
    Found existing installation: click 7.1.2
    Uninstalling click-7.1.2:
      Successfully uninstalled click-7.1.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.1.3 which is incompatible.
Successfully installed black-22.10.0 click-8.1.3 jedi-0.18.1 mypy-extensions-0.4.3 nb-black-1.0.7 pathspec-0.10.1 platformdirs-2.5.2 typed-ast-1.5.4
In [ ]:
# Loading the Dataset

pd.set_option('display.max_columns', None)

from google.colab import files
data_to_load = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving Loan_Modelling.csv to Loan_Modelling.csv
In [ ]:
import io
df = pd.read_csv(io.BytesIO(data_to_load['Loan_Modelling.csv']))

EDA and Data Pre-processing

In [ ]:
# Looking at the Shape of the Dataset

print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
There are 5000 rows and 14 columns.
In [ ]:
# Taking an initial look

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
In [ ]:
# Looking at the types in the Dataset

df.dtypes
Out[ ]:
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIPCode                 int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal_Loan           int64
Securities_Account      int64
CD_Account              int64
Online                  int64
CreditCard              int64
dtype: object

We have 14 columns in the dataset and 5,000 rows. 13 columns have datatype int and 1 column (CCAvg) has datatype float.

In [ ]:
# Confirming df has no null

df.isnull().values.any()
Out[ ]:
False
In [ ]:
# Since missing values can also be 0's in the dataset I'm checking for abnormal amounts of 0's in each of the columns

for column_name in df.columns:
    column = df[column_name]
    # Get the count of Zeros in column 
    count = (column == 0).sum()
    print('Count of zeros in column ', column_name, ' is : ', count)
Count of zeros in column  ID  is :  0
Count of zeros in column  Age  is :  0
Count of zeros in column  Experience  is :  66
Count of zeros in column  Income  is :  0
Count of zeros in column  ZIPCode  is :  0
Count of zeros in column  Family  is :  0
Count of zeros in column  CCAvg  is :  106
Count of zeros in column  Education  is :  0
Count of zeros in column  Mortgage  is :  3462
Count of zeros in column  Personal_Loan  is :  4520
Count of zeros in column  Securities_Account  is :  4478
Count of zeros in column  CD_Account  is :  4698
Count of zeros in column  Online  is :  2016
Count of zeros in column  CreditCard  is :  3530
In [ ]:
# Taking a look at the first 10 rows

df.head(10)
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
5 6 37 13 29 92121 4 0.4 2 155 0 0 0 1 0
6 7 53 27 72 91711 2 1.5 2 0 0 0 0 1 0
7 8 50 24 22 93943 1 0.3 3 0 0 0 0 0 1
8 9 35 10 81 90089 3 0.6 2 104 0 0 0 1 0
9 10 34 9 180 93023 1 8.9 3 0 1 0 0 0 0
In [ ]:
# Taking a look at the last 10 rows

df.tail(10)
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4990 4991 55 25 58 95023 4 2.00 3 219 0 0 0 0 1
4991 4992 51 25 92 91330 1 1.90 2 100 0 0 0 0 1
4992 4993 30 5 13 90037 4 0.50 3 0 0 0 0 0 0
4993 4994 45 21 218 91801 2 6.67 1 0 0 0 0 1 0
4994 4995 64 40 75 94588 3 2.00 3 0 0 0 0 1 0
4995 4996 29 3 40 92697 1 1.90 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.40 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.30 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.50 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.80 1 0 0 0 0 1 1

The 0's present in the data-set don't seem to be abnormal because it makes sense for the context of the columns they are in (Experience, CCAvg, Mortage, PersonalLoan, Securities_Account, CD_Account, Online, CreditCard) since the values will be 0 if no / N/A.

In [ ]:
# Checking for dulicates in df

df.duplicated().sum()
Out[ ]:
0

There are no duplicate entries in df

Univariate Analysis

In [ ]:
# Checking important information of the dataframe columns

df.describe(include='all').T
Out[ ]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
In [ ]:
# Checking for the number of unique values in each of the columns / what the values are

print(df.ID.value_counts())
print(df.Age.value_counts())
print(df.Experience.value_counts())
print(df.Income.value_counts())
print(df.ZIPCode.value_counts())
print(df.Family.value_counts())
print(df.CCAvg.value_counts())
print(df.Education.value_counts())
print(df.Mortgage.value_counts())
print(df.Personal_Loan.value_counts())
print(df.Securities_Account.value_counts())
print(df.CD_Account.value_counts())
print(df.Online.value_counts())
print(df.CreditCard.value_counts())
1       1
3331    1
3338    1
3337    1
3336    1
       ..
1667    1
1666    1
1665    1
1664    1
5000    1
Name: ID, Length: 5000, dtype: int64
35    151
43    149
52    145
54    143
58    143
50    138
41    136
30    136
56    135
34    134
39    133
57    132
59    132
51    129
45    127
60    127
46    127
42    126
31    125
40    125
55    125
29    123
62    123
61    122
44    121
32    120
33    120
48    118
38    115
49    115
47    113
53    112
63    108
36    107
37    106
28    103
27     91
65     80
64     78
26     78
25     53
24     28
66     24
67     12
23     12
Name: Age, dtype: int64
 32    154
 20    148
 9     147
 5     146
 23    144
 35    143
 25    142
 28    138
 18    137
 19    135
 26    134
 24    131
 3     129
 16    127
 14    127
 30    126
 17    125
 34    125
 27    125
 22    124
 29    124
 7     121
 6     119
 15    119
 8     119
 10    118
 13    117
 33    117
 11    116
 37    116
 36    114
 21    113
 4     113
 31    104
 12    102
 38     88
 2      85
 39     85
 1      74
 0      66
 40     57
 41     43
-1      33
-2      15
 42      8
-3       4
 43      3
Name: Experience, dtype: int64
44     85
38     84
81     83
41     82
39     81
       ..
202     2
203     2
189     2
224     1
218     1
Name: Income, Length: 162, dtype: int64
94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94087      1
91024      1
93077      1
94598      1
Name: ZIPCode, Length: 467, dtype: int64
1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64
0.30    241
1.00    231
0.20    204
2.00    188
0.80    187
       ... 
3.25      1
3.67      1
4.67      1
8.90      1
2.75      1
Name: CCAvg, Length: 108, dtype: int64
1    2096
3    1501
2    1403
Name: Education, dtype: int64
0      3462
98       17
119      16
89       16
91       16
       ... 
547       1
458       1
505       1
361       1
541       1
Name: Mortgage, Length: 347, dtype: int64
0    4520
1     480
Name: Personal_Loan, dtype: int64
0    4478
1     522
Name: Securities_Account, dtype: int64
0    4698
1     302
Name: CD_Account, dtype: int64
1    2984
0    2016
Name: Online, dtype: int64
0    3530
1    1470
Name: CreditCard, dtype: int64
In [ ]:
# Installiing Pandas Profiling

!pip install -U pandas_profiling
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: pandas_profiling in /usr/local/lib/python3.7/dist-packages (1.4.1)
Collecting pandas_profiling
  Downloading pandas_profiling-3.4.0-py2.py3-none-any.whl (315 kB)
     |████████████████████████████████| 315 kB 5.2 MB/s 
Collecting requests<2.29,>=2.24.0
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 1.1 MB/s 
Requirement already satisfied: tqdm<4.65,>=4.48.2 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (4.64.1)
Collecting htmlmin==0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Requirement already satisfied: pydantic<1.11,>=1.8.1 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (1.10.2)
Requirement already satisfied: missingno<0.6,>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (0.5.1)
Requirement already satisfied: matplotlib<3.6,>=3.2 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (3.2.2)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (2.11.3)
Collecting statsmodels<0.14,>=0.13.2
  Downloading statsmodels-0.13.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
     |████████████████████████████████| 9.9 MB 39.0 MB/s 
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (0.11.2)
Collecting multimethod<1.10,>=1.4
  Downloading multimethod-1.9-py3-none-any.whl (10 kB)
Collecting visions[type_image_path]==0.7.5
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
     |████████████████████████████████| 102 kB 55.4 MB/s 
Collecting phik<0.13,>=0.11.1
  Downloading phik-0.12.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (690 kB)
     |████████████████████████████████| 690 kB 58.7 MB/s 
Requirement already satisfied: pandas!=1.4.0,<1.6,>1.1 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (1.3.5)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (6.0)
Requirement already satisfied: numpy<1.24,>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (1.21.6)
Requirement already satisfied: scipy<1.10,>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from pandas_profiling) (1.7.3)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.5->pandas_profiling) (2.6.3)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.5->pandas_profiling) (22.1.0)
Collecting tangled-up-in-unicode>=0.0.4
  Downloading tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
     |████████████████████████████████| 4.7 MB 43.5 MB/s 
Collecting imagehash
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
     |████████████████████████████████| 296 kB 50.9 MB/s 
Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.5->pandas_profiling) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2<3.2,>=2.11.1->pandas_profiling) (2.0.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib<3.6,>=3.2->pandas_profiling) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib<3.6,>=3.2->pandas_profiling) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib<3.6,>=3.2->pandas_profiling) (3.0.9)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib<3.6,>=3.2->pandas_profiling) (2.8.2)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib<3.6,>=3.2->pandas_profiling) (4.1.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas!=1.4.0,<1.6,>1.1->pandas_profiling) (2022.5)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.7/dist-packages (from phik<0.13,>=0.11.1->pandas_profiling) (1.2.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib<3.6,>=3.2->pandas_profiling) (1.15.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<2.29,>=2.24.0->pandas_profiling) (2.10)
Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.7/dist-packages (from requests<2.29,>=2.24.0->pandas_profiling) (2.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<2.29,>=2.24.0->pandas_profiling) (2022.9.24)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<2.29,>=2.24.0->pandas_profiling) (1.24.3)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.7/dist-packages (from statsmodels<0.14,>=0.13.2->pandas_profiling) (0.5.3)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.7/dist-packages (from statsmodels<0.14,>=0.13.2->pandas_profiling) (21.3)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash->visions[type_image_path]==0.7.5->pandas_profiling) (1.3.0)
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27098 sha256=9259de8c665e21b57d7878e25650cb9e6a794aad6a8b1ab80666fb04e184dc23
  Stored in directory: /root/.cache/pip/wheels/70/e1/52/5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655
Successfully built htmlmin
Installing collected packages: tangled-up-in-unicode, multimethod, visions, imagehash, statsmodels, requests, phik, htmlmin, pandas-profiling
  Attempting uninstall: statsmodels
    Found existing installation: statsmodels 0.12.2
    Uninstalling statsmodels-0.12.2:
      Successfully uninstalled statsmodels-0.12.2
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pandas-profiling
    Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
Successfully installed htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.9 pandas-profiling-3.4.0 phik-0.12.2 requests-2.28.1 statsmodels-0.13.5 tangled-up-in-unicode-0.2.0 visions-0.7.5
In [ ]:
# Importing ProfileReport

from pandas_profiling import ProfileReport
In [ ]:
# Generating a Pandas ProfileReport to gain some more initial insights

df.profile_report()
Out[ ]:

Initial Observations:

  • There are 5,000 unique values for ID which is the same number of rows in the data-set. This is intuitive/makes sense because every customer would have a unique ID number.
  • There are 45 unique values in the Age column and they range from 23-67. Age has a mean of 45.3384 and median of 45. Q1 is 35, Q3 is 55, and the IQR is 20. The mean and median are significantly close to each other and the skewness value is -0.02934068151 so the Age column is slightly left-skewed but not very much.
  • Experience has 47 unique values which range from -3 to 43, with a mean of 20.1046 and a median of 20. The negative values don't make much sense so I will treat the values later. Q1 is 10, Q3 is 30, and the IQR is 20. The skewness value is -0.0263246884 so the Experience column is also slightly left-skewed but not by very much.
  • Income has 163 unique values which range from 8 to 224. It has a mean of 73.7742 and a median of 64, and it is right-skewed with a skewness of 0.8413386073. Q1 is 39, Q3 is 98, and the IQR is 59.
  • ZIPCode has 467 unique values with a minimum of 90005 and a maximum of 96651. We will treat this column later.
  • Family has 4 unique values: 1 (most common; 1472 values), 2 (next most common; 1296 values), 3 (next most common; 1,222 values), then 4 (least common; 10100).
  • CCAvg has 108 unique values which range from 0 to 10. It has a mean of 1.937938 and a median of 1.5. Q1 is 0.7, Q3 is 2.5, and IQR is 1.8. It has a skewness value of 1.598443337 and is significantly right-skewed.
  • Education has 3 distinct values: 1 (Undergrad: 2096 values), 2 (Graduate: 1501 values), and 3 (Advanced/Professional: 1403 values).
  • Mortage has 347 unique values which range from 0 to 635. There are many 0's in the column (3462) which indicates many people do not have a House Mortgage. Q1 is 0, Q3 is 101, and the IQR is 101. The mean is 56.4988, the median is 0, and Mortage is right-skewed with a skewness value of 2.104002319.
  • Personal_Loan has two values 0 (most common: 4520 values) and 1 (least common: 480 values) which indicates the number of people who did not accept the loan in the last campaign is greater than the number of people who did.
  • Securities_Account has two values 0 (most common: 4478 values) and 1 (least common: 522 values) which indicates that there are a greater number of customers who do not have a securities account with the bank than the number of customers who do.
  • CD_Account has two values 0 (most common: 4698 values) and 1 (least common: 302 values) which indicates that there are a greater number of customers who do not have a Certificate of Deposit with the bank than the number of customers who do.
  • Online has two values 1 (most common: 2984 values) and 0 (least common: 2016 values) which indicates that the number of customers who use internet banking facilities is larger than the number of customers who don't.
  • CreditCard has two values 0 (most common: 3530 values) and 1 (least common: 1470 values) which indicates that the number of customers who don't use a credit card issued by any other bank (excluding All life Bank) is greater than the number of customers who do.

More Observations:

  • Age has a high correlation with Experience. (This makes sense because the older you are, the more experience you are likely to have)
  • Income has a high correlation with CCAvg, Mortgage, and PersonalLoan (This makes sense because a customer with a higher income is likely to have higher monthly credit card expenses, a higher value for their house mortgage, and will be more likely to accept a personal loan than a customer with a lower income)
  • CCAvg has a high correlation with PersonalLoan (This makes sense because a person who is making more expenditures on their credit card is more likely to accept a personal loan).
  • As seen above, the target variable (PersonalLoan) is highly correlated with Income and CCAvg which makes sense. (Someone with a higher income and monthly credit card expenditure is more likely to accept a personal loan than someone with a lower income and monthly credit card expenditure)

Looking more closely/visualizing at each of the individual columns (except for customer ID since we know it is unique and uniform)

Age Column

In [ ]:
sns.histplot(x=df["Age"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f0ee250>
In [ ]:
sns.boxplot(x=df["Age"], showfliers=True, fliersize=5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f1a9b90>

The values for Age look reasonable and there doesn't appear to be any outliers.

Experience Column

In [ ]:
sns.histplot(x=df["Experience"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7eef8c50>
In [ ]:
sns.boxplot(x=df["Experience"], showfliers=True, fliersize=5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f0ecf10>

The values for Experience also look reasonable but there are some negative values which don't make sense (as seen earlier) so I will treat those.

In [ ]:
df.describe()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.467954 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 -3.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000

We see a confirmation of this since the minimum value for Experience is -3

In [ ]:
# Setting negative Experiences values to zero

df.loc[df.Experience < 0, 'Experience'] = 0
In [ ]:
# Checking our changes

df.describe()
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.119600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.440484 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 0.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000
In [ ]:
# Confirming our changes by visualizing the updated Histogram

sns.histplot(x=df["Experience"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7d401290>

We see the negative values have been fixed and set to zero.

Income Column

In [ ]:
sns.histplot(x=df["Income"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7d41df90>
In [ ]:
sns.boxplot(x=df["Income"], showfliers=True, fliersize=5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7d39a890>

The values for Income seem reasonable but the distribution is right-skewed due to the presence of some higher income values. Some outliers exist on the right-side but the presence of higher incomes makes sense and seems important to incorporate in the models so we will not remove these.

ZIPCode Column

In [ ]:
sns.histplot(x=df["ZIPCode"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7d22f7d0>
In [ ]:
sns.boxplot(x=df["ZIPCode"], showfliers=True, fliersize=5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7d23db50>

The values in the ZIPCode column are reasonable. I was having trouble with the USZipCode library so I decided the leave the values are they are and not make them a categorical variable either since there are so many unique ZIPCode values.

Family Column

In [ ]:
sns.histplot(x=df["Family"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7d159090>
In [ ]:
sns.countplot(data=df, x='Family');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1, 2, 3]), <a list of 4 Text major ticklabel objects>)

There are 4 distinct values for family-size (1, 2, 3, 4) which makes sense. As we saw earlier, 1 is the most frequent, 4 is the next most frequent, then 2, and finally 3.

CreditCard Average Column

In [ ]:
sns.histplot(x=df["CCAvg"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7cfe0350>
In [ ]:
sns.boxplot(x=df["CCAvg"], showfliers=True, fliersize=5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f2ce4d0>

The values for average credit card expenses per month make sense. The distribution is right-skewed due to the presence of customers who have significantly higher spending habits and some of these are shown as outliers in the box-plots, but these points make sense and seem important to include in the model so I will not remove them.

Education Column

In [ ]:
sns.histplot(x=df["Education"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f69c150>
In [ ]:
sns.countplot(data=df, x='Education');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1, 2]), <a list of 3 Text major ticklabel objects>)

There are 3 distinct values for Education 1: Undergrad, 2: Graduate, 3: Advanced/Professional. As we saw earlier, Undergrad (1) has the most frequency, Advanced/Professional (3) is next, then finally Graduate (2).

Mortgage Column

In [ ]:
sns.histplot(x=df["Mortgage"], kde = True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f888a50>
In [ ]:
sns.boxplot(x=df["Mortgage"], showfliers=True, fliersize=5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7f4ec2d0>

The distribution for Mortgage is right-skewed due to the presence of many 0 values. We saw earlier that this makes sense because that indicates some customers do not have a house mortgage. We also see the skewness is due to the presence of customers with very high values for their house mortgages. These values are shown as outliers in the box-plot but the values make sense and seem important/significant to include in the model so I will not remove them.

Personal Loan Column (Target Variable)

In [ ]:
sns.histplot(x=df["Personal_Loan"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c804bdb10>
In [ ]:
sns.countplot(data=df, x='Personal_Loan');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1]), <a list of 2 Text major ticklabel objects>)

There are two distinct values for Personal Loan: 0 (the customer did not accept the personal loan offered in the last campaign) and 1 (the customer did accept the personal loan offered in the last campaign). There are more 0 values indicating more customers did not accept the personal loan. This will be our target variable in the models.

Securities Account Column

In [ ]:
sns.histplot(x=df["Securities_Account"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c80535250>
In [ ]:
sns.countplot(data=df, x='Securities_Account');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1]), <a list of 2 Text major ticklabel objects>)

There are two distinct values for Securities Account: 0 (the customer does not have a securities account with the bank) and 1 (the customer does have a securities account with the bank). There are more 0 values which indicates that more customers do not have securities accounts with the bank and these values make sense.

CD Account Column Column

In [ ]:
sns.histplot(x=df["CD_Account"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c808c4710>
In [ ]:
sns.countplot(data=df, x='CD_Account');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1]), <a list of 2 Text major ticklabel objects>)

CD_Account has two distinct values: 0 (the customer does not have a certificate of deposit (CD) account with the bank) and 1 (the customer does have a certificate of deposit (CD) account with the bank). There are more 0 values indicating there are more customers who do not have a a certificate of deposit (CD) account with the bank. One of the business goals is to convert liability customers to personal loan customers while retaining them as depositors so we'll need to keep that in mind.

Online Column

In [ ]:
sns.histplot(x=df["Online"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c82dbd390>
In [ ]:
sns.countplot(data=df, x='Online');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1]), <a list of 2 Text major ticklabel objects>)

There are two distinct values for Online: 0 (the customer does not use internet banking facilities) and 1 (the customer does use internet banking facilities). There are more 1 valus indicating more customers use internet banking facilities than those who don't. This makes sense and will be useful information to know when giving insights to the business.

Credit Card Column

In [ ]:
sns.histplot(x=df["CreditCard"])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c83a17950>
In [ ]:
sns.countplot(data=df, x='CreditCard');
plt.xticks(rotation=90)
Out[ ]:
(array([0, 1]), <a list of 2 Text major ticklabel objects>)

There are two distinct values for CreditCard: 0 (the customer does not use a credit card issued by any other Bank) and 1 (the customer does use a credit card issued by another bank). There are more 0 values than 1's, indicating that more customers do not use a credit card issued by another bank than the number of customers that do.

Observations

  • No null/missing values to treat
  • Detected outliers in a few columns but the values make sense to include in the models so we will not treat them
  • No duplicate entries in data

Next:

  • Bivariate Analysis
  • Prepare Data for Modeling and check the split

Bivariate Analysis

In [ ]:
# Creating a PairPlot 

sns.pairplot(df,diag_kind='kde')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f2c839441d0>
In [ ]:
# Creating a heatmat to observe correlations between the columns

plt.figure(figsize = (16,8))
sns.heatmap(data=df.corr(), annot=True);

We see a confirmation of what we saw in the Pandas ProfileReport:

  • Age has a high correlation with Experience. (This makes sense because the older you are, the more experience you are likely to have)
  • Income has a high correlation with CCAvg, Mortgage, and PersonalLoan (This makes sense because a customer with a higher income is likely to have higher monthly credit card expenses, a higher value for their house mortgage, and will be more likely to accept a personal loan than a customer with a lower income)
  • CCAvg has a high correlation with PersonalLoan (This makes sense because a person who is making more expenditures on their credit card is more likely to accept a personal loan).
  • The target variable (PersonalLoan) is highly correlated with Income and CCAvg which makes sense. (Someone with a higher income and monthly credit card expenditure is more likely to accept a personal loan than someone with a lower income and monthly credit card expenditure)

Looking closer at each of these relationships through visualizations

Age and Experience

In [ ]:
sns.lineplot(data = df , x = 'Age' , y = 'Experience');
plt.xticks(rotation=90);
In [ ]:
sns.scatterplot(data=df, x='Age', y='Experience',hue="Personal_Loan");
plt.xticks(rotation=90);

plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
Out[ ]:
<matplotlib.legend.Legend at 0x7f2c7a61dfd0>

Age and Experience appear to have a mostly strong, direct linear relationship

Income and CCAvg

In [ ]:
sns.lineplot(data = df , x = 'Income' , y = 'CCAvg');
plt.xticks(rotation=90);
In [ ]:
sns.scatterplot(data=df, x='Income', y='CCAvg',hue="Personal_Loan");
plt.xticks(rotation=90);

plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
Out[ ]:
<matplotlib.legend.Legend at 0x7f2c7a50ea90>

As Income increases, CCAvg seems to increase but from the scatterplot we see that is not always the case. We also see that people with higher incomes seem to be more likely to have accepted a personal loan which makes sense.

Income and Mortgage

In [ ]:
sns.lineplot(data = df , x = 'Income' , y = 'Mortgage');
plt.xticks(rotation=90);
In [ ]:
sns.scatterplot(data=df, x='Income', y='Mortgage',hue="Personal_Loan");
plt.xticks(rotation=90);

plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
Out[ ]:
<matplotlib.legend.Legend at 0x7f2c7a4555d0>

As Income increases, Mortgage values seem to increase too which makes sense. But from the scatterplot we see higher income values with a lower corresponding Mortgage value, this makes sense because people with higher incomes are able to pay off their Mortgage and bring the values down/close to (if not) zero. We see the highest income values have Mortgage values of 0 which confirms this.

Personal Loan and Income

In [ ]:
sns.scatterplot(data=df, x='Personal_Loan', y='Income');
plt.xticks(rotation=90);

plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
WARNING:matplotlib.legend:No handles with labels found to put in legend.
Out[ ]:
<matplotlib.legend.Legend at 0x7f2c7a3d8cd0>

People within the middle range of Income appear to be more likely to accept a personal loan than people on the upper or lower bounds of the Income range. This makes sense because people with a higher income have no need for a loan, whereas people in the lower range probably do not want to take a large financial risk.

Personal Loan and CCAvg

In [ ]:
sns.scatterplot(data=df, x='Personal_Loan', y='CCAvg');
plt.xticks(rotation=90);

plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
WARNING:matplotlib.legend:No handles with labels found to put in legend.
Out[ ]:
<matplotlib.legend.Legend at 0x7f2c7a349890>

People with higher CCAvg values are more likely to accept a personal loan than people with lower monthly credit card averages. People with low-middle CCAvg valus also seem to accept personal loans.

Preparing and Checking the Split

In [ ]:
# Preparing the Split

X = df.drop('Personal_Loan',axis=1)    
Y = df['Personal_Loan']  

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
In [ ]:
# Checking the Split

print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))
70.00% data is in training set
30.00% data is in test set
In [ ]:
# Checking the values of Personal Loan in all 3 datasets

print("Original Personal_Loan True Values    : {0} ({1:0.2f}%)".format(len(df.loc[df['Personal_Loan'] == 1]), (len(df.loc[df['Personal_Loan'] == 1])/len(df.index)) * 100))
print("Original Personal_Loan False Values   : {0} ({1:0.2f}%)".format(len(df.loc[df['Personal_Loan'] == 0]), (len(df.loc[df['Personal_Loan'] == 0])/len(df.index)) * 100))
print("")
print("Training Personal_Loan True Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Personal_Loan False Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Personal_Loan True Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Personal_Loan False Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
Original Personal_Loan True Values    : 480 (9.60%)
Original Personal_Loan False Values   : 4520 (90.40%)

Training Personal_Loan True Values    : 331 (9.46%)
Training Personal_Loan False Values   : 3169 (90.54%)

Test Personal_Loan True Values        : 149 (9.93%)
Test Personal_Loan False Values       : 1351 (90.07%)

Model Evaluation Critereon:

Wrong Predictions the Model can make:

  • Case 1: A customer is identified/classified as likely to buy a personal loan, when in reality they will not buy a loan.
  • Case 2: A customer is identified/classified as not likely to buy a personal loan, when in reality they would buy a loan.

Which loss is greater?

  • While the overall business goal of the bank is to convert liability customers to personal loan customers, since the objective of the model is to help the marketing department identify potential loan customers, failing to identify potential customers to advertise to is a greater loss (Case 2).
  • Since the objective of the model is not loan approval and is just about finding potential customers to advertise to incase they are interested, Case 1, where customers identified as likely to buy a loan when they will not is not as big of a loss (customers will see the advertisement but simply not choose to proceed).
  • Therefore the case of False Negatives (Case 2) is a greater loss because customers who might be interested in buying a personal loan will not be advertised to because they will be classified as such.

How to reduce the loss?

  • Since we are trying to reduce loss in terms of the amount of False Negative predictions, we need to focus on maximizing the Recall value (TP / (TP + FN)).

Logistic Regression

In [ ]:
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    plot_confusion_matrix,
    make_scorer,
)
In [159]:
# Function to check model performance using metrics

def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
    """
    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # predicting using the independent variables
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    pred = np.round(pred_thres)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf
In [160]:
# Function to plot the Confusion Matrix with threshold as an argument 

def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix, based on the threshold specified, with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    y_pred = np.round(pred_thres)

    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [162]:
# Creating the model

model = LogisticRegression(solver="newton-cg", random_state=1)
logistic = model.fit(x_train, y_train)
/usr/local/lib/python3.7/dist-packages/scipy/optimize/linesearch.py:478: LineSearchWarning: The line search algorithm did not converge
  warn('The line search algorithm did not converge', LineSearchWarning)
/usr/local/lib/python3.7/dist-packages/scipy/optimize/linesearch.py:327: LineSearchWarning: The line search algorithm did not converge
  warn('The line search algorithm did not converge', LineSearchWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/optimize.py:212: ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations.
  ConvergenceWarning,

Checking performance on the training set

In [163]:
# Creating confusion matrix

confusion_matrix_sklearn_with_threshold(logistic, x_train, y_train)
In [186]:
log_reg_model_train = model_performance_classification_sklearn_with_threshold(
    logistic, x_train, y_train
)

print("Training performance:")
log_reg_model_train
Training performance:
Out[186]:
Accuracy Recall Precision F1
0 0.950857 0.625378 0.811765 0.706485

Checking performance on the test set

In [166]:
# Creating confusion matrix

confusion_matrix_sklearn_with_threshold(logistic, x_test, y_test)
In [185]:
log_reg_model_test = model_performance_classification_sklearn_with_threshold(
    logistic, x_test, y_test
)

print("Training performance:")
log_reg_model_test
Training performance:
Out[185]:
Accuracy Recall Precision F1
0 0.946667 0.57047 0.841584 0.68

We see we have a very poor recall score for both models which can be improved.

Improving the model using AUC-ROC curve

In [168]:
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
)
In [169]:
# Training Set

logit_roc_auc_train = roc_auc_score(y_train, logistic.predict_proba(x_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, logistic.predict_proba(x_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
In [172]:
# Test Set

logit_roc_auc_test = roc_auc_score(y_test, logistic.predict_proba(x_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, logistic.predict_proba(x_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
In [173]:
# Setting the Optimal threshold according to the AUC-ROC curve (where tpr is high and fpr is low)

fpr, tpr, thresholds = roc_curve(y_train, logistic.predict_proba(x_train)[:, 1])

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.07542686103654546
In [174]:
# Plotting confusion matrix of the model with the threshold modified for the training set

confusion_matrix_sklearn_with_threshold(
    logistic, x_train, y_train, threshold=optimal_threshold_auc_roc
)
In [175]:
# checking model performance for training set

log_reg_model_train_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
    logistic, x_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_threshold_auc_roc
Training performance:
Out[175]:
Accuracy Recall Precision F1
0 0.872571 0.915408 0.42025 0.576046
In [177]:
# Plotting confusion matrix of the model with the threshold modified for the test set

confusion_matrix_sklearn_with_threshold(
    logistic, x_test, y_test, threshold=optimal_threshold_auc_roc
)
In [176]:
# checking model performance for test set

log_reg_model_test_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
    logistic, x_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_test_threshold_auc_roc
Training performance:
Out[176]:
Accuracy Recall Precision F1
0 0.886 0.892617 0.461806 0.608696

We see that the recall score has improved significantly by setting the optimal threshold according to AUC-ROC curve. Let's see if we can improve it throuugh the precision-recall curve to find a better threshold value.

Using precision-recall curve to modify the model and find a better threshold value

In [178]:
y_scores = logistic.predict_proba(x_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

We see that around 0.25 we get a higher recall score and good precision score so we will use this as our threshold. We would get equal precision and recall scores with a threshold of around 0.32.

In [179]:
# Setting the threshold value

optimal_threshold_curve = 0.25
In [189]:
# Checking the scores of the new model for the training set

log_reg_model_train_threshold_curve = model_performance_classification_sklearn_with_threshold(
    logistic, x_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_threshold_curve
Training performance:
Out[189]:
Accuracy Recall Precision F1
0 0.940286 0.785498 0.653266 0.713306
In [180]:
# Plotting a confusion matrix for the training set

confusion_matrix_sklearn_with_threshold(
    logistic, x_train, y_train, threshold=optimal_threshold_curve
)
In [183]:
# Checking the scores of the new model for the test set

log_reg_model_test_threshold_curve = model_performance_classification_sklearn_with_threshold(
    logistic, x_train, y_train, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_threshold_curve
Test performance:
Out[183]:
Accuracy Recall Precision F1
0 0.940286 0.785498 0.653266 0.713306
In [181]:
# Plotting a confusion matrix for the test set

confusion_matrix_sklearn_with_threshold(
    logistic, x_test, y_test, threshold=optimal_threshold_curve
)

We see that although our accuracy score has improved, our recall score has gotten worse (which is what we are focused on in this situation). Therefore, we can conclude that the previous model (Model 2) has the best performance of the Logistic Regression models but let's compare them to be sure.

In [190]:
# Training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train.T,
        log_reg_model_train_threshold_auc_roc.T,
        log_reg_model_train_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Initial Model",
    "AUC-ROC Curve",
    "0.25 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[190]:
Initial Model AUC-ROC Curve 0.25 Threshold
Accuracy 0.950857 0.872571 0.940286
Recall 0.625378 0.915408 0.785498
Precision 0.811765 0.420250 0.653266
F1 0.706485 0.576046 0.713306
In [193]:
# Testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test.T,
        log_reg_model_test_threshold_auc_roc.T,
        log_reg_model_test_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Initial Model",
    "AUC-ROC Curve",
    "0.25 Threshold",
]

print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[193]:
Initial Model AUC-ROC Curve 0.25 Threshold
Accuracy 0.946667 0.886000 0.940286
Recall 0.570470 0.892617 0.785498
Precision 0.841584 0.461806 0.653266
F1 0.680000 0.608696 0.713306

We see a confirmation that we get the best recall score on both our training and test sets with Model 2 (where we set the optimal threshold according to the AUC-ROC Curve. (training recall score: 0.915408, test recall score: 0.892617)

In [194]:
# Finding the coefficients and calculating the odds

log_odds = logistic.coef_[0]
pd.DataFrame(log_odds, x_train.columns, columns=["coef"]).T
Out[194]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Securities_Account CD_Account Online CreditCard
coef -0.000026 -0.095126 0.098961 0.051284 -0.000113 0.700465 0.172168 1.654058 0.000755 -0.645125 3.001081 -0.513886 -0.846786
In [197]:
# Converting coefficients to odds

odds = np.exp(logistic.coef_[0])

# Finding the percentage change and adding to a dataframe

perc_change_odds = (np.exp(logistic.coef_[0]) - 1) * 100 
pd.set_option("display.max_columns", None)
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=x_train.columns).T
Out[197]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Securities_Account CD_Account Online CreditCard
Odds 0.999974 0.909258 1.104023 1.052622 0.999887 2.014689 1.187877 5.228153 1.000755 0.524597 20.107256 0.598167 0.428791
Change_odd% -0.002643 -9.074186 10.402291 5.262188 -0.011310 101.468882 18.787700 422.815279 0.075539 -47.540281 1910.725580 -40.183312 -57.120899

Decision Tree

In [ ]:
# Function to calculate different metrics and check model performance

def model_performance_classification_sklearn(model, predictors, target):
    """
    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
# Function to plot the Confusion Matric with Percentages

def confusion_matrix_sklearn(model, predictors, target):
    """
    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [ ]:
# Building the model and fitting it to the training set

model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(x_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In [ ]:
# Checking the Accuracy of the Models fit to the training and test sets

print("Accuracy on training set : ", model.score(x_train, y_train))
print("Accuracy on test set : ", model.score(x_test, y_test))
Accuracy on training set :  1.0
Accuracy on test set :  0.9793333333333333
In [ ]:
#Checking number of positives in Personal Loan
Y.sum(axis = 0)
Out[ ]:
480

Out of 5,000 values, 480 are positive, if we simply marked every value as negative we would get an accuracy of 90.4% which is better than the test set's accuracy but we also need to look at recall and other metrics.

In [ ]:
# Checking Recall, Precision, and F1 score of the training set

decision_tree_perf_train = model_performance_classification_sklearn(
    model, x_train, y_train
)
decision_tree_perf_train
Out[ ]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
# Checking Recall, Precision, and F1 score of the test set

decision_tree_perf_test = model_performance_classification_sklearn(
    model, x_test, y_test
)
decision_tree_perf_test
Out[ ]:
Accuracy Recall Precision F1
0 0.979333 0.892617 0.898649 0.895623

We see our model is slightly overfit to the training data so we'll need to fix this later.

In [ ]:
# Confusion Matrix of Training Set

confusion_matrix_sklearn(model, x_train, y_train)
In [ ]:
# Confusion Matrix of Test Set

confusion_matrix_sklearn(model, x_test, y_test)

Visualizing the Decision Tree

In [ ]:
feature_names = list(X.columns)
print(feature_names)
['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [ ]:
plt.figure(figsize=(20,30))
tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [ ]:
# Checking the Gini importance of each feature

print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Education           0.402286
Income              0.304042
Family              0.157297
CCAvg               0.053165
CD_Account          0.024352
ID                  0.017656
Experience          0.017286
ZIPCode             0.011810
Age                 0.009880
Mortgage            0.002224
Securities_Account  0.000000
Online              0.000000
CreditCard          0.000000
In [ ]:
# Visualizing these values

importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='pink', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
  • By far, the most important features for predicting whether a customer is likely to accept a Personal Loan or not are Education, Income, then Family size.
  • The next most important features are CCAvg, CD_Account, ID, Experience, and ZIP Code.
  • We saw that the model is overfitting and that our tree is very complex so we'll now try to fix this and check the new model's metrics & performance.

Reducing over-fitting by setting the tree's max-depth to 4

In [ ]:
# Making a new Decision Tree Model with a max depth of 4

model2 = DecisionTreeClassifier(criterion = 'gini',max_depth=4,random_state=1)
model2.fit(x_train, y_train)
Out[ ]:
DecisionTreeClassifier(max_depth=4, random_state=1)
In [ ]:
# Checking the Accuracy of the New Models fit to the training and test sets

print("Accuracy on training set : ", model2.score(x_train, y_train))
print("Accuracy on test set : ", model2.score(x_test, y_test))
Accuracy on training set :  0.9885714285714285
Accuracy on test set :  0.9813333333333333
In [150]:
# Checking Recall, Precision, and F1 score of the training set

decision_tree2_perf_train = model_performance_classification_sklearn(
    model2, x_train, y_train
)
decision_tree2_perf_train
Out[150]:
Accuracy Recall Precision F1
0 0.988571 0.912387 0.964856 0.937888
In [ ]:
# Confusion Matrix of Training Set

confusion_matrix_sklearn(model2, x_train, y_train)
In [151]:
# Checking Recall, Precision, and F1 score of the test set

decision_tree2_perf_test = model_performance_classification_sklearn(
    model2, x_test, y_test
)
decision_tree2_perf_test
Out[151]:
Accuracy Recall Precision F1
0 0.981333 0.865772 0.941606 0.902098
In [ ]:
# Confusion Matrix of Test Set

confusion_matrix_sklearn(model2, x_test, y_test)
In [ ]:
# Visualizing the Tree

plt.figure(figsize=(15,10))

tree.plot_tree(model2,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()

We see that over-fitting has decreased with the less complex model (test and train values are closer together), and our recall score has improved slightly. Let's see if hyperparameter tuning uses GridSearch helps more.

Hyperparameter Tuning using GridSearch

In [ ]:
from sklearn.model_selection import GridSearchCV

# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from

parameters = {
    "max_depth": np.arange(1,10),
    'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
    "min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(x_train, y_train)
Out[ ]:
DecisionTreeClassifier(criterion='entropy', max_depth=4,
                       min_impurity_decrease=1e-06, random_state=1)
In [ ]:
# Accuracy of the new model

print("Accuracy on training set : ",estimator.score(x_train, y_train))
print("Accuracy on test set : ",estimator.score(x_test, y_test))
Accuracy on training set :  0.9888571428571429
Accuracy on test set :  0.98
In [152]:
# Checking Recall, Precision, and F1 score of the Training set

decision_tree3_perf_train = model_performance_classification_sklearn(
    estimator, x_train, y_train
)
decision_tree3_perf_train
Out[152]:
Accuracy Recall Precision F1
0 0.988857 0.912387 0.967949 0.939347
In [ ]:
# Confusion Matrix of Training Set

confusion_matrix_sklearn(estimator, x_train, y_train)
In [154]:
# Checking Recall, Precision, and F1 score of the test set

decision_tree3_perf_test = model_performance_classification_sklearn(
    estimator, x_test, y_test
)
decision_tree3_perf_test
Out[154]:
Accuracy Recall Precision F1
0 0.98 0.852349 0.940741 0.894366
In [ ]:
# Confusion Matrix of Test Set

confusion_matrix_sklearn(estimator, x_test, y_test)
In [ ]:
# Visualizing the Tree

plt.figure(figsize=(15,10))

tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
  • We see that this model is also not nearly as complex as the first one and therefore overfits less, but we also see our recall score has gotten worse from the second tree made with a max-depth of 4.
  • We will now try Cost-Complexity pruning.

Cost Complexity Pruning

In [ ]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [ ]:
# Visualizing Effectivee Alpha vs Total Leaf Impurity

fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

As expected, as effective alpha increases the total impurity of leaves does as well.

In [ ]:
# Training the Decision Tree using Effective Alphas

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(x_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
In [ ]:
# Function to get Recall score for training set

recall_train=[]
for clf in clfs:
    pred_train3=clf.predict(x_train)
    values_train=metrics.recall_score(y_train,pred_train3)
    recall_train.append(values_train)
In [ ]:
# Function to get Recall score for test set

recall_test=[]
for clf in clfs:
    pred_test3=clf.predict(x_test)
    values_test=metrics.recall_score(y_test,pred_test3)
    recall_test.append(values_test)
In [ ]:
# Plotting Recall vs Alpha for the new model

fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
In [ ]:
# Creating the model with the highest train and test recall

index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
In [148]:
# Checking training set performance for the new model

decision_tree4_perf_train = model_performance_classification_sklearn(
    best_model, x_train, y_train
)
decision_tree4_perf_train
Out[148]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [ ]:
# Plotting training set confusion matrix for the new model

confusion_matrix_sklearn(best_model, x_train, y_train)
In [149]:
# Checking testing set performance for the new model

decision_tree4_perf_test = model_performance_classification_sklearn(
    best_model, x_test, y_test
)
decision_tree4_perf_test
Out[149]:
Accuracy Recall Precision F1
0 0.979333 0.892617 0.898649 0.895623
In [ ]:
# Plotting testing set confusion matrix for the new model

confusion_matrix_sklearn(best_model, x_test, y_test)
In [ ]:
# Visualizing the Tree

plt.figure(figsize=(20,30))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()

We see that while the recall score has improved from Models 2 and 3, this model is extremely complex and does overfit.

Comparing the training and testing perfomance of all of the models

In [156]:
# Comparing the training set performance of all of the Decision Tree models

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        decision_tree2_perf_train.T,
        decision_tree3_perf_train.T,
        decision_tree_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree 1 sklearn",
    "Decision Tree (Max Depth: 4))",
    "Decision Tree (Hypertuned)",
    "Decision Tree (Post-Pruning)"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[156]:
Decision Tree 1 sklearn Decision Tree (Max Depth: 4)) Decision Tree (Hypertuned) Decision Tree (Post-Pruning)
Accuracy 0.980000 0.988571 0.980000 0.980000
Recall 0.852349 0.912387 0.852349 0.852349
Precision 0.940741 0.964856 0.940741 0.940741
F1 0.894366 0.937888 0.894366 0.894366
In [158]:
# Comparing the testing set performance of all of the Decision Tree models

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        decision_tree2_perf_test.T,
        decision_tree3_perf_test.T,
        decision_tree4_perf_test.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree 1 sklearn",
    "Decision Tree (Max Depth: 4))",
    "Decision Tree (Hypertuned)",
    "Decision Tree (Post-Pruning)"
]
print("Test set performance comparison:")
models_train_comp_df
Test set performance comparison:
Out[158]:
Decision Tree 1 sklearn Decision Tree (Max Depth: 4)) Decision Tree (Hypertuned) Decision Tree (Post-Pruning)
Accuracy 0.981333 0.981333 0.980000 0.979333
Recall 0.865772 0.865772 0.852349 0.892617
Precision 0.941606 0.941606 0.940741 0.898649
F1 0.902098 0.902098 0.894366 0.895623

We see that we get the best recall score for our training data with Model 2 (which had a max-depth of 4). But the best recall score for our test data comes from our 4th model created with Post-Pruning. The 4th model has a higher test recall score than its training recall score (which is the metric important in this situation) and that it's accuracy remains consistently high so we can conclude that it is the best model to use for this situation despite its complexity.

Logistic Regression and Decision Tree Model Comparison

  • We see that our best Logistic Regression model with the highest recall score (made with the optimal threshold from ROC-AUC) has a training recall score or 0.915408 and a test recall score of 0.892617.
  • Meanwhile, our best Decision Tree model with the highest recall score (using post-pruning) has a training recall score of 0.852349 and a test recall score of 0.892617.
  • Therefore, we can conclude that for this situation where having a high recall score is the main priority, the Logistic Regression model made with the optimal threshold set according to ROC-AUC is the best option over the other Logistic Regression models and any of the Decision Tree models.

Additional Visualizations to Illustrate Key Takeaways

In [200]:
# Function to create stacked bar-plots

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [203]:
stacked_barplot(df, "Education", "Personal_Loan")
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
In [207]:
sns.histplot(data=df, x="Income", hue="Personal_Loan")
Out[207]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c79f7b890>
In [208]:
stacked_barplot(df, "Family", "Personal_Loan")
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------
In [209]:
sns.histplot(data=df, x="CCAvg", hue="Personal_Loan")
Out[209]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c79ef3950>
In [210]:
stacked_barplot(df, "CD_Account", "Personal_Loan")
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
In [212]:
sns.histplot(data=df, x="Experience", hue="Personal_Loan")
Out[212]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c76f74d90>
In [214]:
sns.histplot(data=df, x="ZIPCode", hue="Personal_Loan")
Out[214]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c799a8fd0>
In [215]:
sns.histplot(data=df, x="Age", hue="Personal_Loan")
Out[215]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c7960d250>
In [216]:
sns.histplot(data=df, x="Mortgage", hue="Personal_Loan")
Out[216]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2c79d24390>

Key Takeaways and Business Insights

When creating a marketing campaign, it is important to know who the target demographic is. Based on the findings from the EDA and models, here are some key takeaways/insights for the marketing team to inform their campaign decisions:

  • Education is the most important factor in determining whether or not a person will accept a personal loan. People who have a Graduate or Advanced/Professional level of education or more likely to buy a personal loan than customers who only have an Undergraduate level of education.
  • Income is the second biggest deciding factor in determining whether or not a person will accept a personal loan. Customers with higher incomes tend to be more likely to buy a personal loan than those with lower incomes.
  • Family size is the third biggest deciding factor in determining whether or not a person will accept a personal loan. Customers with larger family sizes (sizes 3 or 4) or more likely to buy a personal loan than customers with a lower family size (1 or 2).
  • Monthly credit card expenses is the fourth biggest deciding factor in determining whether or not a person will accept a personal loan. People with the highest monthly credit card expenses are less likely to accept a personal loan, customers with low to moderately high credit card expenses are more likely to buy a loan.
  • CD_Account is the fifth biggest deciding factor in determining whether or not a person will accept a personal loan. Customers with a CD account are more likely to buy a personal loan than customers without one.
  • Experience is the sixth biggest deciding factor in determining whether or not a person will accept a personal loan. People of all experience levels have a likelihood of buying a loan, but as the experience levels increase towards the highest experience values, the likelihood slowly decreases.
  • Zip Code is the sixth biggest deciding factor in determining whether or not a person will accept a personal loan. Customers from most zip code ranges have a likelihood of purchasing a personal loan but some zip code ranges have a higher likelyhood of buying a personal loan.
  • Age is the seventh biggest deciding factor in determining whether or not a person will accept a personal loan. People of all ages have a likelihod of buying a loan, but people under 25 have a zero/very small likelyhood of buying a loan.
  • Mortgage is the eighth biggest deciding factor in determining whether or not a person will accept a personal loan. People with 0 (or no mortgage) values have a high likelihood of buying a personal loan (presumably because they want to purchase their first house).
In [ ]: